Computing tips for awesome research and an easy life

Robert Turner, University of Sheffield RSE Team September, 2021

Acknowledgements

Contains elements from Reproducible Research Data and Project Management in R, by Anna Krystalli and from Methods in Research Software Engineering by David Wilby.

About me

Bob Turner

Mix of software engineering and research experience.

RSE Team

RSE

13 RSEs, 35 projects / year worth ~£11m total

This presentation is flawed

Focusses on what to do, not how to do it.

In this session…

Discussion

What are the characteristics of well engineered research software?

Link to interactive doc

“Good” research software

  • Version control
  • Automated tests
  • Controlled execution environment
  • Documentation
  • Parameterised

https://joss.readthedocs.io/en/latest/review_checklist.html

Data Management

Data Management Plan

  • Start early. Make an RDM plan before collecting data.
  • Anticipate data products as part of your thesis outputs.
  • Think about what technologies to use.

Own your data

Take initiative & responsibility. Think long term.

Spreadsheets?

Do you agree?

Excel

But good for data viewing / entry, sometimes, perhaps…

Databases

Have a look at the Data Carpentry SQL for Ecology lesson

Data formats

  • .csv: comma separated values.
  • .tsv: tab separated values.
  • .txt: no formatting specified.

more unusual formats will need instructions on use.

Ensure data is machine readable

Andrea De Santis, unsplash.com

bad

bad

good

ok

  • could help data entry
  • .csv or .tsv copy would need to be saved.

Basic quality control

Use good null values, missing values are a fact of life:

  • Usually, best solution is to leave blank
  • NA or NULL are also good options
  • NEVER use 0. Avoid numbers like -999
  • Don’t make up your own code for missing values

Data security

Raw data are sacrosanct

Give yourself less rope

{ height=256px }

  • It’s a good idea to revoke your own write permission to the raw data file.
  • Then you can’t accidentally edit it.
  • It also makes it harder to do manual edits in a moment of weakness, when you know you should just add a line to your data cleaning script.

Photo by Jon Moore, unsplash.com

Know your main copies

{ height=256px }

  • identify the main copy of files
  • keep it safe and and accessible
  • consider version control
  • consider centralising

source: Pexels CC0

How to avoid catastrophes

Backup: on disk

Backup: in the cloud

  • dropbox, googledrive etc.
  • if installed on your system, can programmatically access them through R
  • some version control

Backup: the Open Science Framework osf.io

  • version controlled
  • easily shareable
  • works with other apps (eg googledrive, github)

Backup: Github

  • most solid version control.
  • keep everything in one project folder.
  • Can be problematic with really large files.